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ABSTRACT 


The  philosophical  and  computational  foundations  of  a  system 
reliability  analysis  are  discussed.  Recent  advances  in  net 
work  and  logic  tree  computational  methods  are  reviewed. 

Long  run  performance  formulas  for  systems  subject  to  pre¬ 
ventive  maintenance  are  given. 


SYSTEM  RELIABILITY  ANALYSIS:  FOUNDATIONS 


Richard  E.  Barlow 
Operations  Research  Center 
University  of  California 
Berkeley,  CA  94720 

1.  INTRODUCTION 

System  reliability  analysis  problems  arise  in  many  practical  engineering  areas. 
Some  of  these  include  communication  networks,  electrical  power  systems,  water  trans¬ 
mission  systems,  nuclear  power  reactors,  and  transportation  systems.  We  will  illus¬ 
trate  some  of  the  ideas  basic  to  a  system  reliability  analysis  via  our  experience  in 
analyzing  a  proposed  Satellite  X-ray  Test  Facility  (SXTF).  This  facility  would  test 
spa^j satellites  relative  to  an  electromagnetic  radiation  environment. 

'The  purpose  of  a  system  reliability  analysis  is  to  acquire  information  about  a 
system  of  interest  relative  to  making  decisions  based  on  considerations  of  avail¬ 
ability,  reliability,  and  safety  as  well  as  any  Inherent  engineering  risks.  The 
philosophy  and  guidelines  for  a  system  analysis  have  been  discussed  In  several  ex¬ 
cellent  Introductory  chapters  by  David  Haasl  in  a  Fault  Tree  Handbook  (1981).  Broad¬ 
ly  speaking,  there  are  two  important  aspects  to  a  system  analysis:  (1)  An  INDUCTIVE 
ANALYSIS  stage  and  (2)  A  DEDUCTIVE  ANALYSIS  stage.  In  the  inductive  analysis  stage 
*e  gather  and  organize  available  information  on  the  system.  We  define  the  system, 
describe  its  functional  purpose  and  determine  its  critical ^bmponents.  At  this 
stage,  we  ask  the  question:  WHAT  can  happen  to  the  system  as  a  result  of  a  component 
failure  or  a  human  error?  We  hypothesize  and  guess  possible  system  failure  scenarios 
as  well  as  system  success  modes.  A  Preliminary  Hazard  Analysis  is  often  performed 
at  the  system  level.  A  Failure  Modes  and  Effects  Analysis  is  conducted  at  the  com¬ 
ponent  level. 

The  DEDUCTIVE  ANALYSIS  aspect  of  a  system  reliability  analysis  answers  the 
question:  HOW  can  a  system  fall  (or  succeed)  or  be  unavailable?  A  logic  tree  (or 
fault  tree  If  we  are  failure  oriented)  is  often  the  best  device  for  deducing  how  a 
major  system  failure  event  could  possibly  occur.  However,  its  construction  depends 
on  a  thorough  understanding  of  the  system  and  the  results  of  the  system  inductive 
analysis.  A  block  diagram  or  a  network  graph  is  a  useful  device  for  representing  a 
successfully  functioning  system.  Since  the  network  graph  is  close  to  a  system 
functional  representation.  It  cannot  capture  abstract  system  failure  and  human  error 
events  as  well  as  the  logic  tree  representation.  However,  from  the  point  of  view 
of  mathematical  probability  analysis,  the  network  graph  representation  seems  to  be 
correspondingly  easier  to  analyze. 


The  Operations  Research  Center  at  Berkeley  has  completed  two  projects  so  far 
Involving  extensive  system  reliability  analysis  of  a  proposed  X-ray  test  facility. 

One  subsystem  providing  the  photon  source  is  composed  of  192  individual  modules. 

They  are  attached  at  one  end  to  the  Marx  capacitor  bank,  and  the  other  end  penetrates 
into  the  vacuum  chamber  and  terminates  in  the  X-ray  producing  diode  (see  Figure  1). 
Each  module  is  filled  with  water  that  is  separated  from  the  oil  in  the  Marx  tank  by 
an  epoxy  diaphragm  and  from  the  vacuum  chamber  by  a  styrene  insulator  plate. 


FIGURE  1 .  PROPOSED  X-RAY  TEST  FACILITY 

In  the  inductive  phase  of  our  system  analysis  we  listed  all  possible  mechanical 
and  electrical  failure  modes  that  we  could  envision.  This  led  to  a  critical  com¬ 
ponents  list  Including  assessed  failure  rates.  For  each  member  of  the  list,  a  de¬ 
tailed  failure  modes  and  effects  form  was  filled  out  by  engineers  concerned  with  the 
project.  This,  together  with  a  detailed  discussion  of  possible  system  faults,  con¬ 
stituted  our  "inductive  analysis." 

It  is  well-known  that  system  failures  often  occur  at  subsystem  interfaces.  In 
the  deductive  phase  of  our  analysis  we  were  most  concerned  with  the  oil-water  and 
also  the  water- vacuum  interfaces.  Fault  trees  were  constructed  for  water  leakage 
from  the  tube  into  the  vacuum  chamber,  for  oil/water  mixing  and  also  for  satellite 
contamination.  These  fault  trees  pinpointed  failure  modes  which  might  have  been 
otherwise  overlooked.  In  particular,  as  a  result  of  these  fault  trees,  certain  com¬ 
ponents  were  redesigned  to  prevent  potential  failures.  The  fault  trees  provided  use 
ful  visual  tools  for  describing  the  logic  leading  up  to  possible  serious  system  fail 
ure  events.  They  provided  the  basis  for  contending  that  all  likely  critical  failure 
events  have  been  found  and  studied.  Finally,  a  simple  block  diagram  of  our  system 
was  used  to  implement  a  system  availability  analysis.  In  the  next  section  we  show 
how  to  analyze,  probabilistically,  more  complex  networks. 
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!.  CALCULATION  OF  SYSTEM  RELIABILITY 


The  logical  relationship  between  component  events  and  system  events  Is  best 
represented  by  a  network  graph  or  a  logic  tree.  A  Boolean  expression  can  be  derived 
from  either  representation  which  can  then  be  used  to  calculate  the  probability  of 
system  events  of  Interest.  However,  recent  research  on  the  computational  complexity 
of  network  reliability  problems  has  shown  that  Boolean  computational  methods  are  not 
efficient.  Chang  (1981)  In  Chapter  3  of  his  Ph.D.  thesis  discusses  the  Boolean  al¬ 
gebra  approach  and  Backtrack  algorithms  In  this  regard. 


Networks 


Suppose  we  consider  a  network  graph  representation  such  as  the  undirected  net 
work  In  Figure  2. 


s 


t 


FIGURE  2.  UNDIRECTED  TWO  TERMINAL  NETWORK 


In  this  case,  system  success  occurs  If  there  Is  at  least  one  working  path  of  nodes 
and  arcs  from  source  s  to  terminal  t  .  Let  the  Boolean  Indicator 


!1  If  arc  1  works 
0  otherwise. 


For  convenience,  suppose  nodes  are  perfect  so  that  (xj^.  . ...  Xg)  Is  a  state 
vector  for  our  network.  Let 


♦(*1 »X2» 


Xfl) 


!1  If  s  and  t  can  communicate 
0  otherwise. 


Such  systems  are  called  ooharent  ayatma  In  Barlow  and  Proschan  (1981).  Basically, 
+  Is  coherent  if  It  Is  nondecreasing  coordlnatewlse.  All  coherent  systems,  ♦  , 
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can  be  represented  as  two  terminal  networks  with  possible  replication  of  some  arcs. 

A  minimal  path  set  for  the  network  in  Figure  2  is  P1  =  {1,4,7}  for  example. 
There  are  8  such  min  path  sets  (P^.Pg,  ....  Pg}  .  Hence,  $  can  be  represented  as 

$  (x,  ,x~ ,  ....  xfi)  =  1  -  n  /l  -  n  x.\  ^  u  n  x<  .  (2.1) 

12  8  r=l  \  iEPr  V  r=l  iePr  1 

By  expanding  this  expression,  using  the  usual  arithmetic  ( not  Boolean  arithmetic)  and 
replacing  x"  by  x^  ,  we  can  obtain  an  expression  suitable  for  computing  the  system 
success  probability.  Assuming  arc  failure  events  are  statistically  independent  we 
need  only  replace  x^  by  p^  ,  the  probability  arc  i  works,  in  the  resulting  ex¬ 
pression. 

However,  there  is  a  far  more  efficient  method  for  doing  this  calculation  -  called 
the  factoring  algorithm.  The  idea  is  to  first  perform  all  possible  series  and  paral¬ 
lel  probability  reductions  and  then  pivot  on  an  arc.  Let  £  *  (P1,P2»  •••»  Pg)  and 
h({>)  denote  the  probability  that  s  and  t  communicate.  If  we  "pivot"  on  arc  i 
then  we  obtain  the  "pivotal"  decomposition”  of  h(j>)  ,  namely 

h(£)  =  P1h(li ,£)  +  (1  -pi)h(01,£)  (2.2) 

where  ( 1 ^ *£)  *  (Pj.Pg.  •••«  P^_1 . 1 ^ ,p^+1 ,  ....  pg)  .  This,  together  with  series 
and  parallel  reductions,  is  the  mathematical  basis  for  the  factoring  algorithm. 

In  Figure  2  no  series  or  parallel  reductions  are  possible,  so  we  pivot  on  arc  1. 
That  is,  we  short  arc  1  on  the  left  and  delete  arc  1  on  the  right.  Series  and  paral¬ 
lel  reductions  are  now  possible  on  the  two  modified  graphs.  After  performing  these 
reductions,  we  again  pivot.  In  our  binary  computational  tree  in  Figure  3  there  are 
4  leaves  at  the  bottom  of  the  tree.  Neglecting  parallel  and  series  reductions  except 
at  the  last  stage,  we  have  performed  only  2(4)  -1=7  operations  to  achieve  our  re¬ 
liability  computation.  If  each  arc  1  has  probability  p  of  working,  it  is  easy  to 
see  that  the  system  reliability  In  this  case  is 

P{s  can  communicate  with  the  terminal  t) 

=  h(p) 

-  P2(((((P“  p)p)“  p)p)  x  p)  +  p(l  - p)(((pu  p)p)(p2*  p)) 

+  p(l  - p) ( (p(p *  P) )  x  (p2))p  +  (1  -p)2((p3*  p)p2) 

=  (p3 +p4 +p5 -5p6 +4p7 -p8)  +  (2p4 -p5 -4p6  +  4p7 -p8) 

+  (3p4  -  4p5  -  p6  +3p7  -  p8)  +  (p3 -2p4+2p5 -3p6+3p7 -p8) 

-  2p3  +  4p4  -  2p5  - 1 3p6  » 1  4d7  -4p8  . 
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The  lower  case  "ip"  operator,  u  ,  corresponds  to  calculating  the  reliability  of 
parallel  arcs;  i.e.. 

Pi  U  p.  =  Pi  +  p.  -  PiPj  • 

In  Figure  3,  =  1  -  p^  . 

Linear  and  polynomial  time  algorithms  are  now  available  for  computing  network 
reliability  when  the  underlying  graph  has  a  series-parallel  topology.  For  example, 
the  graph  in  Figure  4  is  called  a  topologically  series-parallel  graph  even  though 
the  same  graph  in  Figure  2  with  distinguished  nodes  s  and  t  is  not  series-parallel 
with  respect  to  reliability  computation. 


FIGURE  4.  A  TOPOLOGICAL  SERIES-PARALLEL  GRAPH 
(NO  DISTINGUISHED  NODES) 


For  undirected  networks,  the  basic  reference  Is  A.  Satyanarayana  and  Kevin  Wood  (198Z). 
For  direated  networks,  the  basic  reference  Is  Avlnash  Agrawal  and  A.  Satyanarayana 
(1982). 

If  the  arc  reliability,  p  ,  Is  unknown  but  there  Is  data  available,  then  we  may 
assess  our  uncertainty  about  p  by  a  probability  density.  If  a  Beta  prior  density 
is  used,  then  the  posterior  density  is  also  Beta  and  is  of  the  form 


•<p|*.w  ■  fftifre}  >>b'' 


Our  final  system  reliability  assessment  is  now 


R  ■  I  h(pMp|a,b)dp 


A  common  mistake  is  to  compute  the  expected  arc  reliability 


pir(p|a,b)dp  - 


and  compute  h^--*  .  However,  R  f  h^-* 


logic  Trees 

Logic  tree  (or  fault  tree)  analysis  is  a  detailed  deductive  analysis  that  usually 
requires  considerable  system  information.  It  is  best  applied  during  the  design  stages 
of  a  system.  At  that  point,  it  can  identify  hazardous  conditions  and  potential  acci¬ 
dents  in  a  system  design  and  thus  can  help  eliminate  costly  design  changes  and  retro¬ 
fits  that  would  otherwise  have  to  be  made  later  in  the  system  life  cycle.  Undesired 
events  requiring  logic  tree  analysis  are  identified  either  by  inductive  analysis  or 
by  intuition.  These  events  are  usually  undesired  system  states  that  can  occur  as  a 
result  of  subsystem  functional  faults. 

A  logic  tree  is  a  model  that  graphically  and  logically  represents  the  various 
combinations  of  possible  events,  both  fault  and  normal,  occurring  in  a  system  that 
lead  to  the  top  undesired  event.  The  logic  tree  is  so  structured  that  the  undesired 
event  appears  as  the  top  event  in  the  logic  tree.  The  sequences  of  events  that  lead 
to  the  undesired  event  are  shown  below  the  top  event  and  are  logically  linked  to  the 
undesired  event  by  standard  OR  and  AND  gates.  The  input  events  to  each  logic  gate 
that  are  also  outputs  of  other  logic  gates  at  a  lower  level  are  shown  as  rectangles. 
(Rectangles  are  called  gate  events.)  These  events  are  developed  even  further  until 
the  sequences  of  events  lead  to  basic  causes.  The  basic  events  appear  as  circles  and 
diamonds  on  the  bottom  of  the  fault  tree  and  represent  the  limit  of  resolution.  The 
circle  represents  an  internal  or  primary  failure  of  a  system  element  when  exercised 
within  the  design  envelop  of  the  system.  The  diamond  represents  a  failure,  other 
than  a  primary  failure,  that  is  purposely  not  further  developed.  Gate  nodes  corre¬ 
spond  to  intermediate  events  while  the  top  node  usually  corresponds  to  a  very  serious 
system  failure  event.  In  Figure  6,  all  arcs  are  regular  with  the  exception  of  the 
complementing  arc  joining  nodes  6  and  4,  and  this  arc  is  distinguished  by  the  symbol 


Associated  with  each  gate  is  a  logic  symbol:  OR  gates  have  a  plus  symbol  (for 
set  union)  while  AND  gates  have  a  product  (•)  symbol  (for  set  Intersection).  For  ex¬ 
ample,  output  event  3  occurs  if  either  input  event  4  or  5  (or  both)  occur.  Likewise 
output  event  5  occurs  only  if  both  Input  events  7  and  8  occur.  Since  the  arc  connect¬ 
ing  gate  events  4  and  6  Is  complemented,  gate  event  4  occurs  only  if  basic  event  11 
occurs  and  gate  event  6  does  not  occur. 

A  complete  reliability  analysis  on  an  extensive  system  such  as  the  SXTF  System 
normally  requires  three  levels  of  fault  tree  development,  as  shown  in  Figure  5.  The 
upper  level,  called  the  top  structure,  Includes  the  top  undesired  event  and  the  sub- 
undesired  events  that  are  potential  accidents  and  hazardous  conditions  that  are  imme¬ 
diate  causes  of  the  top  event.  The  next  level  of  the  logic  tree  divides  the  operation 
of  the  system  Into  phases,  subphases,  etc.,  until  the  system  environment  remains 


Top 
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FIGURE  5.  LEVELS  OF  LOGIC  TREE  DEVELOPMENT 


constant  and  the  system  characteristics  do  not  change  the  fault  environment.  In  this 
second  level  of  fault  tree  development,  the  analyst  examines  system  elements  from  a 
functional  point  of  view.  He  uses  a  structuring  process  to  develop  fault  flows  with¬ 
in  the  system  that  deductively  lead  toasubsystem  and  detailed  hardware  flow,  which 
is  the  third  level  of  the  fault  tree.  At  the  third  level,  the  analyst  is  faced  with 
one  of  the  most  difficult  aspects  of  logic  tree  analysis.  He  must  determine  if  basic 
events  are  statistically  independent.  He  then  focuses  his  attention  on  common  events 
that  can  simultaneously  fail  two  or  more  system  elements.  The  effects  of  any  common 
environmental  or  operational  stresses  are  studied,  as  well  as  the  effects  of  the 
human  factor  in  the  testing,  maintenance,  and  operation  of  the  system. 

Once  the  logic  tree  is  constructed,  all  logically  possible  accident  scenarios 
(called  minimal  cuts)  can  be  obtained.  There  are  many  algorithms  and  computer  prog¬ 
rams  for  finding  minimal  cuts  (or  prime  implicants  for  general  logic  trees).  One  of 
the  best  of  these  is  a  computer  program  called  FTAP  due  to  Randall  Willie  (1978). 

The  minimal  cuts  can  then  be  used  to  compute  the  probability  of  gate  events  including 
the  TOP  event.  A  sensitivity  analysis  can  be  performed  using  a  so-called  marginal 
importance  measure  which  is  essentially  the  partial  derivative  of  system  reliability 
with  respect  to  component  reliability. 


Mathematics  of  Fault  Tree  Analysis 

Boolean  switching  theory  is  basic  for  the  mathematics  of  fault  tree  analysis. 

For  the  fault  tree  node  set  U  =  [1,2,  . ..,  q]  ,  let  x-|,x2,  . ..,  xq  be  Boolean 
variables  assuming  values  0  or  1  and  let  x^  =  (x^.Xg,  ....  xq)  .  (In  Figure  6, 
q  *  14.)  For  any  u  in  U  ,  let  x_u  =  1  -  xu  .  The  index  set  for  complements  is 
-U  =  [-1,-2,  ....  -q]  and  (u,-u)  is  a  complementary  pair  of  indices. 

Expressions  may  be  formed  using  x^ ,  ...»  x^  ,  x_j ,  ....  x_q  and  the  ordinary 
Boolean  relations  of  product  and  sum.  An  arbitrary  nonempty  family  I  of  subsets 
of  U  u  (-U)  (not  necessarily  distinct)  is  identified  with  the  Boolean  sum-of-products 
expression 


[  n  x.  ,  (2.5) 

lei  1  el  1 


where  I  is  a  member  of  the  family  I  .  (Remember,  the  arithmetic  is  Boolean.)  The 
notation  /I/x  denotes  the  value  of  this  expression  for  a  given  vector  x  of  0's 
and  l's,  that  is. 


/I/x  s  max  /min  x.\  * 
lei  \ 1 e I  7 


l  n 

IeI  1 e I 


(2.6) 


Given  nonempty  families  I  and  J  of  subsets  of  Uu  (-U)  ,  /I/  *  /J/  means  that 
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for  all  x  /I/x  =  /J/x  .  It  Is  further  assumed  that  no  set  of  a  fanily  contains  a 
complementary  pair.  Whenever  a  new  family  is  constructed,  any  set  containing  comple¬ 
mentary  pairs  is  simply  eliminated. 

A  family  Is  said  to  be  minimal  if  all  sets  are  distinct  and  for  any  two  sets  of 
the  family,  neither  is  a  subset  of  the  other.  For  any  family  I  ,  let  m(I)  (the 
"minimization"  of  I)  be  the  minimal  family  obtained  by  eliminating  duplicate  sets 
and  those  which  contain  another  set  of  I  .  For  Instance,  m([{2,3},  {1,2,3}])  = 
[{2,3}]  .  Of  course,  for  any  I  ,  /m( I )/  =  /!/  .  The  first  task  of  a  fault  tree 
analysis  is  to  obtain  a  certain  minimal  family  of  sets  of  U  u  (-U)  called  a  prime 
impliaant  fanily.  We  are  only  Interested  in  prime  imp! leant  families  for  fault  tree 
nodes  which  we  wish  to  analyze  since  such  families  are  unique  and  determine  the  Boolean 
expression  for  the  node  Indicator.  For  Figure  6  and  node  1, 

P  -  [{9, 10}, {12, 14}, {13}, {11}] 
is  a  prime  implicant  family  and 


x 


1 


l  n 

PcP  1eP 


(2.7) 


where  P  Is  a  member  of  the  family  P  and  x^  Is  the  indicator  for  the  top  event 
In  Figure  6.  The  first  task  of  a  fault  tree  analysis  is  to  obtain  the  prime  Implicant 
families  for  fault  tree  nodes  of  special  Interest. 

For  trees  without  complemented  arcs,  the  prime  impllcants  are  called  minimal  cut 
sets.  The  minimal  cut  set  family  for  a  large  fault  tree  (having,  say,  more  than  100 
gate  nodes)  may  consist  of  millions  of  sets.  If  the  tree  has  an  appreciable  number  of 
OR-type  gates.  A.  Rosenthal  (1975)  has  shown  that  the  general  problem  of  finding  the 
complete  minimal  cut  set  family  associated  with  a  fault  tree  Is  a  member  cf  the  class 
of  NP-complete  problems.  (A  class  of  problems  for  which  it  is  conjectured  that  no 
algorithm  exists  which  will  always  run  on  a  computer  within  a  polynomial  time  bound.) 
Hence  we  cannot  expect  to  devise  an  algorithm  whose  running  time  is  bounded  for  all 
fault  trees  by  a  polynomial  In,  say,  the  number  of  fault  tree  nodes.  The  serious 
analyst  should  probably  not  rely  on  the  same  method  for  every  fault  tree. 


Sensitivity  Analysis  for  Coherent  Systems  and  Logic  Trees 

Often  the  relative  Importance  ranking  of  components  In  a  coherent  system  (or  of 
basic  events  In  a  logic  tree)  Is  more  useful  than  the  probability  of  system  success 
or  failure.  We  will  use  coherent  system  terminology  to  Illustrate  the  concept  of 
marginal  importance.  We  define  the  marginal  Importance,  1^(1 )  ,  of  component  1 
to  be 


when  components  are  statistically  independent  and  h(j>)  is  the  system  reliability. 
From  the  pivotal  decomposition  in  (2.2),  it  is  clear  that 


=  h(litjj)  -  h(0.,£)  .  (2.9) 

This  is  also  valid  for  general  logic  trees  where  h(j>)  is  the  probability  of  the 
Top  Event.  If,  in  addition,  $  is  nondecreasing  coordinatewise, 

3h(p) 

V1)  =  ~pT'  =  »x_)  -  4>(0i  ,2i)  =  1  I  E>  (2.10) 

so  that  Ih(i )  is  the  probability  that  component  i  is  "critical"  at  a  given  time 
instant.  This  means  that  with  i  working  the  system  works,  but  with  i  failed,  the 
system  is  failed. 

The  reliability  importance  of  components  may  be  used  to  evaluate  the  effect  of 
an  improvement  in  component  reliability  on  system  reliability,  as  follows.  By  the 
chain  rule  for  differentiation. 


dh  _  ?  3h  *1 
dt  -  j£,  3Pj  dt  • 

where  t  is  a  common  parameter,  say,  the  time  elapsed  since  system  development  began. 
Using  (2.8),  we  have 


dh  _ 

at ' 


(2.11) 


Thus  the  rate  at  which  system  reliability  grows  is  a  weighted  combination  of  the  rates 
at  which  component  reliabilities  grow,  where  the  weights  are  the  reliability  Importance 
numbers . 

From  (2.11),  we  may  also  obtain 


Ah  j  ^  I h( j )APj  ,  (2.12) 


where  Ah  is  the  perturbation  in  system  reliability  corresponding  to  perturbations 
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ul 


APj  In  component  reliabilities.  As  In  (2.11),  the  reliability  importance  numbers 
enter  as  weights.  Thus  small  Improvements  Ap.  in  component  reliabilities  lead  to 

J 

a  corresponding  Improvement  Ah  in  system  reliability  in  accordance  with  (2.12). 


Examples: 

Assume  components  have  been  labeled  so  that  component  reliabilities  are  ordered 
as  follows: 

P]  £  P2  i  **•  1  Pn  • 


ii 

(a)  Series  System.  If  h(p)  ■  n  p.  ,  then 

1«1  1 


and  Ih(l)  >_  1^(2)  >  •••  >_  Ih(n)  ,  so  that  the  component  with  lowest  reli¬ 
ability  Is  the  most  Important  to  the  system.  This  reflects  the  well-known 
principle  that  "a  chain  is  as  strong  as  its  weakest  link." 


n 

*b)  Parallel  System.  If  h(p)  ■  u  P*  , 

1-1  1 


IhU)  ■  n  (1  -  pj 
"  Ifj  1 

and  Ih(l)  <.  Ih(2)  <  •••  <_  I^(n)  ,  so  that  the  component  with  highest  reli¬ 
ability  is  the  most  Important  to  the  system.  This,  too,  is  intuitively 
reasonable,  since  if  just  one  component  functions,  the  system  functions. 

The  concept  of  marginal  Importance  plays  a  very  key  role  In  a  computer  program 
called  PAFT,  [T.  Barlow  and  K.  Wood  (1982)]  for  analyzing  logic  trees.  This  program 
calculates  the  probability  of  all  gate  events  given  the  probabilities  for  basic  events. 
It  then  calculates  the  marginal  Importances  of  all  gate  and  basic  events  relative  to 
the  Top  Event.  Given  failure  rates  for  basic  events  and  using  the  marginal  impor¬ 
tances,  the  program  also  calculates  marginal  occurrence  rates  for  basic  events  rela¬ 
tive  to  the  Top  Event.  The  Top  Event  occurrence  rate  for  selected  time  points  is 
then  calculated  as  the  sum  of  basic  event  marginal  occurrence  rates. 

This  program  attempts  to  take  optimum  advantage  of  the  tree  structure  for  the 
probability  calculation.  It  neither  finds  nor  uses  minimal  cut  sets  for  this  purpose. 
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3.  SYSTEM  AVAILABILITY  ANALYSIS 


In  most  system  reliability  analyses,  It  Is  necessary  to  evaluate  the  effect  of 
maintenance  procedures  on  overall  system  availability  and  performance.  For  example, 
the  following  questions  are  of  Interest: 

1.  What  Is  the  long  run  expected  time  average  of  the  number  of  system  failures? 

2.  What  Is  the  long  run  expected  average  of  system  up  (down)  times? 

3.  How  often  do  we  expect  a  specific  component  to  "cause"  system  failure? 

(We  say  that  a  component  "causes"  system  failure  If  system  failure  coincides 
with  that  component's  failure.) 

We  will  consider  two  system  models  of  general  Interest. 

MODEL  A:  Coherent  Systems  with  Separately  Maintained  Components 

For  this  model,  failure-repair  processes  In  different  system  component  positions 
are  assumed  to  be  statistically  Independent.  This  Is  somewhat  unrealistic  since  we 
also  suppose  that  functioning  components  continue  to  operate  (and  perhaps  fall)  even 
when  the  system  Is  down. 

MODEL  B:  Series  Systems  Whose  Functioning  Components  Suspend  Operation  During  Repair 

This  model  represents  the  other  extreme  relative  to  MODEL  A.  In  this  case  func¬ 
tioning  components  are  In  "suspended  animation"  so  to  speak  when  the  system  Is  down. 
While  In  "suspended  animation"  components  do  not  age  and  cannot  fall. 

For  both  models,  we  assume  continuous  failure  and  repair  distributions.  Let 
N^(t)  be  the  number  of  times  In  [0,t]  that  component  1  "causes"  system  failure. 
Let 

N(t)  •  l  N, (t) 

1-1  1 

be  the  number  of  system  failures  In  [0,t]  where  n  Is  the  number  of  system  com¬ 
ponents.  We  call  the  expected  time  average  of  the  number  of  system  fail¬ 

ures  In  [0,t]  .  In  general.  It  will  be  time  dependent.  When 

iflui 

t—  1 


exists,  we  call  this  the  long  run  expected  time  average  of  the  number  of  system  fall 


ures. 


MODEL  A:  LONG  RUN  PERFORMANCE  FORMULAS 


Most  of  the  formulas  which  answer  the  previous  three  questions  Involve  computing 
the  reliability,  function,  h(jj)  ,  discussed  in  Section  2.  Although  this  function  is 
based  on  the  binary  case  (components  are  either  working  or  failed  at  a  specific  time), 
it  also  plays  a  crucial  role  in  the  dynamic,  time  dependent  case.  Under  Model  A,  we 
have  an  alternating  renewal  process  for  each  component  position  of  our  coherent  sys¬ 
tem,  *  .  .Let  component  type  1  have  mean  life  and  mean  repair  time  v.j  .  Let 

(1  if  component  i  is  working  at  time  t  , 

x1<t)  - 

(0  otherwise. 


Then  the  system  indicator  function  Is 

x(t)  -  ♦Cx1(t),x2(t),  ....  xn(t)] 

and 

A(t)  *  P[X(t)  *  1] 

-  E*[X(t)]  »  h[A1(t),A2(t) . An(t)]  (3.1) 

where  A^(t)  Is  the  probability  that  component  1  is  available  at  time  t  .  The 
long  run  availability  is 


11m  A(t)  *  h  — ^ — 
t-*«  L 1  l 


m2 

UZ  +  v2 


(3.2) 


The  number  of  failures  In  component  position  1  ,  N^(t)  ,  generates  a  (delayed) 
renewal  counting  process  {^(t)  ;  t  >_  0}  .  Let  M^(t)  ■  EN^(t)  .  It  Is  proved  in 
Barlow  and  Proschan  (1975)  that  the  expected  number  of  system  failures  in  [0,t] 
caused  by  component  1  is 


ENj(t) 


[h(1rA(u))  -  h(01>A(u))]dN1(u)  . 


(3.3) 


From  this  result  It  can  be  shown  that 

UmlEN^t)  «  [h(  1 1  ,A)  -  h(0i,A)]/(u1 
t 


(3.4) 


where  A  «  (A,, A,,  ....  A  )  and  A.  ■  m/(u,  +  vj  .  The  long  run  expected  time 
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average  of  the  number  of  system  failures  is  then 


11m  EMtl=  i  [h(1.,A)  -  h(0.,A)]/(u.  +  v.)  . 
t—  1  i»1  1  1  11 


If  ....  Uk  are  successive  system  uptimes,  then  it  can  be  shown  that 


lim 

k-*« 


E[Ui  +  Ug  +  ...  +  Uk] 


h(A) 


I  [h(  1 . ,A)  -  h(0. »A)]/ (u .  +  v,) 
1*1  1  1  11 


If  D^Dg,  •••»  Dk  are  successive  system  downtimes,  then 

1  -  h(A) 


11m 

k-*« 


ECD,  +  ...  +  Dk] 


l  [h(1.,A)  -  h(0.,A)]/(u,  +  v.) 
1=1  1  1  11 


the  long  run  average  of  system  downtimes. 


Example:  Series  System 


In  this  case,  the  long  run  expected  time  average  of  the  number  of  system 


is 


r :  _4_i  j  j-.a  I  ± 
Lj-1  uj  vjJ  1=1  M1  1=1  W1 


while 


E[U,  +  ...  +  Uj  f  n 

11m - ! - r - S. 

k— 


u 


and  letting  u 


U  €  ■ 


11m 

k-*- 


E[0^  +  . . .  +  Dk] 


n  y, 

1  -  n  — 1 


j=l  ui  *  vj  _  /l  -  A\ 

i  iSh’W- 


j=i  [vi  +  vjJ  i=i  wi 


(3.5) 


(3.6) 


(3.7) 


failures 

(3.8) 


(3.9) 


(3.10) 


MODEL  B:  LONG  RUN  PERFORMANCE  FORMULAS 


Under  this  model,  functioning  components  suspend  operation  during  repair  of  non 
functioning  components.  If  any  component  in  this  series  system  fails,  the  remaining 
components  are  shut  off  and  remain  in  suspended  animation  until  the  failed  component 
is  fixed. 

Let  U(t)  [D(t)]  be  the  cumulated  system  uptime  [downtime]  by  time  t  .  In 
Barlow  and  Proschan  (1975),  Chapter  7,  it  is  shown  that  the  long  run  average  system 
availability  is,  in  this  case 


av 


“  IS  *  jA(u>d“ '  ('  *  j.  $  ’ 


(3.11 


If  lim  A(t)  exists,  then  it  is  the  same  as  (3.11). 
t-~ 


The  limiting  expected  time  average  of  the  number  of  system  failures  caused  by 
component  1  is 

£N,(t) 

’  »1  ' 


11m 


(3.12 


Hence,  the  long  run  expected  time  average  of  the  number  of  system  failures  is 


I  f  • 

1  8V  1-1  W1 


(3.13 


The  long  run  average  of  system  uptimes  is 


lim 

k— 


Ef^  +  U2  +  ...  +  Uk]  /  n  1 


iU) 


-i 


(3.14 


The  long  run  average  of  system  downtimes  is 


11m 

k-*« 


E[Dj  +  D2  +  +  D^] 


(3.15 


Compare  (3.8)  and  (3.13);  also  (3.9)  and  (3.14);  also  (3.10)  and  (3.15). 


Availability  of  Series  Systems  with  Preventive  Maintenance 


Most  systems  are  subject  to  planned  maintenance.  In  calculating  system  avail* 
ability,  it  seems  unfair  that  planned  maintenance  downtime  should  count  against  good 
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system  performance.  Hence,  we  define  A^a<1urg  as  long  run  system  availability  when 
downtime  due  to  routine  maintenance  is  not  considered  as  contributing  to  system  un¬ 
availability.  Conversely,  system  failure  unavailability  is,  in  the  long  run,  the 
fraction  of  time  the  system  is  down  due  to  a  component  or  subsystem  failure.  The 
following  discussion  shows  how  this  fraction  (or  percentage)  may  be  computed. 

The  Pulse  Radiation  Source  (an  X-ray  system)  is  basically  a  series  system  of 
five  major  subsystems:  Marx,  waterline,  tube,  source,  and  source  shield.  Each  sub¬ 
system  has  a  prescribed  time  between  scheduled  maintenance  and  a  maintenance  down¬ 
time  (see  Table  1).  In  addition,  system  failures  may  cause  additional  unscheduled 
maintenance  downtime.  When  scheduled  or  unscheduled  maintenance  is  performed  on  a 
subsystem,  the  other  subsystems  are  said  to  be  in  suspended  animation.  When  the  sub¬ 
system  is  maintained  or  repaired,  all  subsystems  resume  normal  operation.  For  the 
purpose  of  availability  analysis,  we  assume  that  maintained  or  repaired  subsystems 
are  "like  new." 

A  table  of  maintenance  downtimes,  failure  repair  downtimes,  and  subsystems  mean 
times  to  failure  follows  (Table  1).  Since  there  are  four  shots  per  8-hour  work  day, 
we  let  1  system  shot  equal  2  hours  and  all  times  in  the  table  are  expressed  in  hours. 

TABLE  1 

PULSE  RAOIATION  SOURCE 
MAINTENANCE  AND  FAILURE  INFORMATION 


Subsystem 

Maintenance 

Frequency 

Maintenance 

Downtime 

Mean  Time 
To  Failure 

M1 

Failure  Repair 
Mean  Downtime 

V1 

Marx 

After 

50  shots 
or  100  h 

a  h 

400  shots 
or  BOO  h 

8  h 

Waterline 

After 

50  shots 
or  100  h 

8  h 

500  shots 
or  1000  h 

18  h 

Tube 

After 

50  shots 
or  100  h 

8  h 

200  shots 
or  400  h 

24  h 

After 

5  shots 
or  10  h 

4  h 

Source/ 

Shield 

After 
every  shot 
or  2  h 

1  h 

— 

— 

Since  the  source/shield  is  repaired  after  every  shot,  no  mean  time  to  failure 
is  assessed.  Since  the  Marx,  Waterline  and  Tube  are  periodically  maintained,  the 
failure  rate  is  considered  constant  (one  divided  by  the  mean  time  to  failure). 

There  is  a  natural  lOOh  operating  cycle  for  the  maintenance  regime.  Since  after 
lOh,  lOOh,  etc.  more  than  one  subsystem  is  serviced,  each  downtime  corresponds  to  the 
longest  required  service  time.  In  a  lOOh  operating  cycle,  the  total  maintenance 
downtime  accumulated  is 


T  =  1.0[50  -  10]  +  4[10  -  1]  +  8[1]  *  84  hours. 


Let  t  be  calendar  time  (in  units  of  working  hours)  and  U(t)  the  cumulated 
system  uptime  in  calendar  time  t  .  Let  D^r  be  the  r-th  downtime  to  repair  a  fail¬ 
ure  of  subsystem  i  .  Then  overall  system  availability,  AQ  ,  in  the  long  run  is 


lim 

t-M» 


mi 


u(t) 


(3.16) 


r=l 


since  y^-T 
calendar  time 
and  the  tube. 


will  be  approximately  the  downtime  due  to  preventive  maintenance  in 
t  .  In  our  example,  k  =  3  corresponding  to  the  Marx,  waterline 
Hence 


t  ■■  V1 

'  *TW*  ,1,  17 


0.519 


which  looks  bad!  However,  If  we  only  count  downtime  due  to  failures,  then  long  run 
availability  in  this  case,  called  Afa^ure  »  is 


failure 


11m 

t-H» 


U(t)  ♦  $2-  T 

Hit!  k '  n.iuftn 
u“>  *T$-  T  *  I,  '  J,  °1r 


1«1  r»l 


1  ♦  T/100 

T  k  ^7 

'  *  TO5  *  4,  ^ 


0.955. 


(3.17) 


Let  ^  be  the  long  run  system  availability  with  respect  to  planned  malnte 
nance  (i.e. ,  the  fraction  of  time  the  system  is  not  down  due  to  planned  maintenance). 
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main  t . 


lim - - =  fl  + 

t—  u(t)  +  yiU  T  *-  10°J 


=  0.543. 

Note  that  from  (3.16),  (3.17)  and  (3.18)  we  have 


(3.18) 


4o  '  ^maint.  *  ^failure  ’ 

This  is  valid  assuming  failure  occurrence  is  independent  of  scheduled  maintenance. 
If  we  neglect  planned  maintenance  downtimes,  then  from  (3.11),  we  have 


h.  au.it)] .  r, . \  vi'' 

t-  4  L  1*1  uij 


(3.19) 


and  from  (3.13)  the  long  run  expected  time  average  of  the  number  of  system  failures 
(neglecting  planned  maintenance  downtimes)  is 


[i  +  l  ^-1  l  -1  . 

L  i-i  MiJ  i-i  ui 


In  100  operating  hours,  we  expect 


loofi  +1—1  l  -L 

L  i-i  uiJ  i-i  mi 


system  failures  so  that  for  a  planned  operating  and  maintenance  cycle  of  100  +  T 
hours  we  expect  the  long  run  average  number  of  failures  per  hour  to  be 


100 


k  v. 

1  +  I  — 


1-1 


K  1 

I  — 


i-i  Mu  i-i  Mi 


Too  +  T 


-  2.377  x  10'vh 


(3.20) 


where  x  *  x,.  +  x 


Marx  +  ^Waterline  +  xTube 


and 


Marx 


-  6.256  x  10"4/h 


x Water line  "  5-005  x  10'4/h 
xTube  "  L25  x  10'3/h  • 


It  is  clear  that  the  Tube,  the  Marx  and  the  Waterline  are  the  most  critical  subsystems 
and  in  that  order. 
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